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Abstract —Optimal control synthesis in stochastic systems 
with respect to quantitative temporal logic constraints can 
be formulated as linear programming problems. However, 
centralized synthesis algorithms do not scale to many practical 
systems. To tackle this issue, we propose a decomposition-based 
distributed synthesis algorithm. By decomposing a large-scale 
stochastic system modeled as a Markov decision process into 
a collection of Interacting sub-systems, the original control 
problem is formulated as a linear programming problem with 
a sparse constraint matrix, which can be solved through 
distributed optimization methods. Additionally, we propose 
a decomposition algorithm which automatically exploits, if 
exists, the modular structure in a given large-scale system. 
We Illustrate the proposed methods through robotic motion 
planning examples. 

1. Introduction 

For many systems, temporal logic formulas are used to 
describe desirable system properties such as safety, stability, 
and liveness [1]. Given a stochastic system modeled as a 
Markov decision process (MDP), the synthesis problem is 
to find a policy that achieves optimal performance under a 
given quantitative criterion regarding given temporal logic 
formulas. For instance, the objective may be to find a 
policy that maximizes the probability of satisfying a given 
temporal logic formula. In such a problem, we need to 
keep track of the evolution of state variables that capture 
system dynamics as well as predicate variables that encode 
properties associated with the temporal logic constraints [2], 
[3]. As the number of states grows exponentially in the 
number of variables, we often encounter large MDPs, for 
which the synthesis problems are impractical to solve with 
centralized methods. The insight for control synthesis of 
large-scale systems is to exploit the modular structure in a 
system so that we can solve the original problem by solving 
a set of small subproblems. 

In literature, distributed control synthesis methods are pro¬ 
posed in the pioneering work for MDPs with discounted re¬ 
wards [4], [5]. The authors formulate a two-stage distributed 
reinforcement learning method; The first stage constructs and 
solves an abstract problem derived from the original one, and 
the second stage iteratively computes parameters for local 
problems until the collection of local problems’ solutions 
converge to one that solves the original problem. Recently, 
alternating direction method of multipliers (ADMM) is com¬ 
bined with a sub-gradient method into planning for average- 
reward problems in large MDPs in [6]. However, the method 
in [6] applies only when some special conditions are satisfied 
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on the costs and transition kernels. Alternatively, hierarchical 
reinforcement learning introduces action-aggregation and 
action-hierarchies to address the planning problems with 
large MDPs [7]. In action-aggregation, a micro-action is a 
local policy for a subset of states and the global optimal 
policy maps histories of states into micro-actions. However, 
it is not always clear how to define the action hierarchies and 
how the choice of hierarchies affects the optimality in the 
global policy. Additionally, the aforementioned methods are 
in general difficult to implement and cannot handle temporal 
logic specifications. 

For synthesis problems in MDPs with quantitative tem¬ 
poral logic constraints, centralized methods and tools [3], 
[8] are developed and applied to control design of stochastic 
systems and robotic motion planning [9]-[ll]. Since central¬ 
ized algorithms are based on either value iteration or linear 
programming, they inevitably hit the barrier of scalability 
and are not viable for large MDPs. 

In this paper, we develop a distributed optimization 
method for large MDPs subject to temporal logic contraints. 
We first introduce a decomposition method for large MDPs 
and prove a property in such a decomposition that supports 
the application of the proposed distributed optimization. 
For a subclass of MDPs whose graph structures are Planar 
graphs, we introduce an efficient decomposition algorithm 
that exploits the modular structure for the underlying MDP 
caused by loose coupling between subsets of states and its 
constituting components. Then, given a decomposition of 
the original system, we employ a distributed optimization 
method called block splitting algorithm [12] to solve the 
planning problem with respect to discounted-reward objec¬ 
tives in large MDPs and average-reward objectives in large 
ergodic MDPs. Comparing to two-stage methods in [5], [6], 
[13], our method concurrently solves the set of sub-problems 
and penalizes solutions’ mismatches in one step during each 
iteration, and is easy to implement. Since the distributed 
control synthesis is independent from the way how a large 
MDP is decomposed, any decomposition method can be 
used. Lastly, we extend the method to solve the synthesis 
problems for MDPs with two classes of quantative temporal 
logic objectives. Through case studies we investigate the 
performance and effectiveness of the proposed method. 

II. Preliminaries 

Let E be a finite set. Let E*, E‘^ be the set of finite and 
infinite words over E. card(E) is the cardinality of the set 
E. A probability distribution on a finite set S' is a function 
Z? : S —)■ [0,1] such that ^('®) = 1- 'The support of 



D is the set Supp(_D) = {s S S' | D{s) > 0}. The set of 
probability distributions on a finite set S is denoted I?(S). 

Markov decision process: A Markov decision process 
(MDP) M = (S, A, Mo, P) consists of a finite set S of states, 
a finite set A of actions, an initial distribution uq S T>(S) 
of states, and a transition probability function P : S x A —> 
I?(S) that for a state s G S and an action a G A gives 
the probability P{s,a){s') of the next state s'. Given an 
MDP M we define the set of actions enabled at state s as 
A(s) = {a G A I 3s' S S,P{s,a){s') > 0}. The cardinality 
of the set {(s, a) \ s G S, a G A(s)} is the number of state- 
action pairs in the MDP. 

Policies; A path is an infinite sequence sqSi ... of states such 
that for all i > 0, there exists a G A, s^+i G Supp(P(si, a)). 
A policy is a function f : S* —>■ D(A) that, given a finite 
state sequence representing the history, chooses a probability 
distribution over the set A of actions. Policy / is memoryless 
if it only depends on the current state, i.e., f : S —>■ 2?(A). 
Once a policy / is chosen, an MDP M is reduced to a 
Markov chain, denoted . We denote by Xi and Oi the 
random variables for the z-th state and the i-th action in this 
chain . Given a policy /, for a measurable function (p 
that maps paths into reals, let EI^ [p] (resp. [p]) be the 
expected value of p when the policy / is used given uq being 
the initial distribution of states (resp. s being the initial state). 

Rewards and objectives: Given an MDP M, a reward 
function R : S' x A —> K and a policy /, let 7 be a discounting 
factor, the discounted-reward value is defined as Val:^(uo) = 
Val:^(s) • ■uo(s) where Val:((s) = pf {J2n=o^'^RiXn, On)); 
the average-reward value is defined as Val'^(Mo) = Val'^(s) • 
Mo(s) where Val'^(s) = lim„^oo ELo 
A discounted-reward (resp. an average-reward) problem is, 
for a given initial state distribution, to obtain a policy 
that maximizes the discounted-reward value (resp. average- 
reward value). For discounted-reward (average-reward) prob¬ 
lems, the optimal value can be attained by memoryless 
policies [14]. 

A solution to the discounted-reward problem can be found 
by solving the linear programming (LP) problem: 


max > > xis.a) ■ Ris.a) (la) 

sGS a^A{v) 

subject to (lb) 

^ a;(s,a)- 7 -^ ^ x{sPa') ■ P{sPa'){s) 

aGA(s) s'^S a'^A{s') 

= mo(s),VsGS, (Ic) 


where m is the total number of state-action pairs in the 
MDP, K™ is the non-negative orthant of K™, and variable 
x{s,a) can be interpreted as the expected discounted time 
of being in state s and taking action a. Once the LP 
problem in ([T]| is solved, the optimal policy is obtained as 
f(s,a) = ^ — x(s,a ) — objective function’s value 

l^a'eA(s) ) 

is the optimal discounted-reward value under policy / given 
the initial distribution uq of states. 


In an ergodic MDP, the average-reward value is a constant 
regardless of the initial state distribution [15] (see [15] for 
the definition of ergodicity). We obtain an optimal policy for 
an average-reward problem by solving the LP problem 


max 
xG 


X! x{s,a) ■ R{s,a) (2a) 

sGS a^A{s) 

subject to 

^ x{s,a)-'^ ^ x(s',a') • P(s',a')(s) = 0, 


aGA(s) s'GSa'^A{s') 

Vs G S, 

sGS aGA{s) 


(2b) 

( 2 c) 


where x(s, o) is understood as the long-run fraction of time 
that the system is at state s and the action a is taken. Once the 
LP problem in ® is solved, the optimal policy is obtained as 
/(s, a) = ^ a') • optimal objective value is the 

optimal average-reward value and is the same for all states. 

Distributed optimization: As a prelude to the distributed 
synthesis method developed in section [TV] now we describe 
the alternating direction method of multipliers (ADMM) [16] 
for the generic convex constrained minimization problem 
minjgc g{z) where function g is closed proper convex and 
set C is closed nonempty convex. In iteration k of the 
ADMM algorithm the following updates are performed; 


.k+1/2 

■■= proXg(z'' - 

(3a) 

zk +1 

+ S^), 

(3b) 

~k+l 

:= -b 

(3c) 


where and are auxiliary variables. He is 

the (Euclidean) projection onto C, and proXg(M) = 
argmin^, {g{x) 3- {p/2)\\x — m|||) is the proximal operator 
[16] of g with parameter p > 0. The algorithm handles 
separately the objective function g in ( [^ and the constraint 
set C in ( [Jb] ). In ( [3^ the dual update step coordinates these 
two steps and results in convergence to a solution of the 
original problem. 

Temporal logic: Linear temporal logic (LTL) formulas are 
defined by; p •= p \ ^p \ pi V p 2 \ OP \ piUp 2 , where 
p G AV is an atomic proposition, and Q U are temporal 
modal operators for “next” and “until”. Additional temporal 
logic operators are derived from basic ones: Pp := traeUp 
(eventually) and Dp := -^P^p. Given an MDP M, let AP 
be a finite set of atomic propositions, and a function L : 
S -G 2^^ be a labeling function that assigns a set of atomic 
propositions L{s) C AV to each state s G S that are valid at 
the state s. L can be extended to paths in the usual way, i.e., 
L{sp) = L{s)L{p) for s G p G S‘^. A path p = soSi • • ■ G 
satisfies a temporal logic formula p if and only if L{p) 
satisfies p. In an MDP M, a policy / induces a probability 
distribution over paths in S^. The probability of satisfying 
an LTL formula p is the sum of probabilities of all paths 
that satisfy p in the induced Markov chain MV 




Problem 1: Given an MDP M and an LTL formula ip, 
synthesize a policy that optimizes a quantitative performance 
measure with respect to the formula ip in the MDP M. 

We consider the probability of satisfying a temporal logic 
formula as one quantitative performance measure. We also 
consider the expected frequency of satisfying certain recur¬ 
rent properties specihed in an LTL formula. 

By formulating a product MDP that incorporates the 
underlying dynamics of a given MDP and its temporal 
logic specification, it can be shown that Problem [T] with 
different quantitative performance measures can be formu¬ 
lated through pre-processing as special cases of discounted- 
reward and average-reward problems [3], [17]. Thus, in the 
following, we first introduce decomposition-based distributed 
synthesis methods for large MDPs with discounted-reward 
and average-reward criteria. Then, we show the extension for 
solving MDPs with quantitative temporal logic constraints. 

III. Decomposition of an MDP 
A. Decomposition and its property 

To exploit the modular structure of a given MDP, the 
initial step is to decompose the state space into small subsets 
of states, each of which can then be related to a small 
problem. In this section, we introduce some terminologies 
in decomposition of MDPs from [5]. 

Given an MDP M = {S, A, uq, P), let 11 be any partition 
of the state set S. That is, 11 = {S'!,..., Si\f} C 2'®, 0 ^ 11, 
Si n Sj =0 when i ^ j and IJ^i Si = S. A set in 11 
is called a region. The periphery of a region Si is a set 
of states outside Si, each of which can be reached with a 
non-zero probability by taking some action from a state in 
Si- Fomally, Periphery(S'i) = {s' G S \ Si \ 3(s, a) G 
Si X A, P(s, > 0}- 

Let Kq = Ui=i Periphery(S'i). Given a region Si G 11, 
we call Ki = Si \ Kg the kernel of Si. We denote rrii 
the number of state-action pairs restricted to Ki, for each 
i — 0,... ,N. That is, rrii is the cardinality of the set 
{(s,a) I s G Ki,a G ^(s)}. We call the partition {Ki \ 
0 < i < N} a decomposition of M. The following property 
of a decomposition is exploited in distributed optimization. 

Lemma 1: Given a decomposition {Ki, i = 0,1,..., N} 
obtained from partition 11 = {S'!,..., Sn}, for a state s G 
Ki where j 0, if there is a state s' and an action a such 
that P{s',a){s) ^ 0, then either s' G Kq or s' G Ki . 

Proof: Suppose s' ^ Kq and s' Ki, then it must 
be the case that s' G Kj for some j f 0 and j f i. Since 
from state s' G Sj, after taking action a, the probability 
of reaching s G 5^ is non-zero, we can conclude that 
s G Periphery(S'j), which implies s G Kg. The implication 
contradicts the fact that s G Ki since Kq H Ki = id. Hence, 
either s' G Ki or s' G Kq . ■ 

Example: Consider the MDP in Figure [T] which is taken 
from [18]. The shaded region shows a partition 11 = 
{S'! = { 34 , 55 , 361,52 = { 3 o, 3 i, 32 , 33, 37}} of the State 
space. Then, Periphery(5i) = { 32 , 37 } and Periphery( 52 ) = 
{34}. We obtain a decomposition of M as Kq = 
U-^iPeriphery(5i) = {52, 34, 37}, Ki = Si \ Kq = {55,36}, 


and K 2 = S 2 \Kq = {sq, 5 i, 33}. It is observed that state S5 
can only be reached with non-zero probabilities by actions 
taken from states .s., and .sk. 



Fig. 1: Example of an MDP with states Q = { 3 ^,^ = 
0, ...,7}, actions A = {a,/3}, and transition probability 
function P as indicated. 

B. A decomposition method for a subclass of MDPs 

Various methods are developed to derive a decomposi¬ 
tion of an MDP, for example, decompositions based on 
partitioning the state space of an MDP according to the 
communicating classes in the induced graph (defined in the 
following) of that MDP (see a survey in [13]). For the 
distributed synthesis method developed in this paper, it will 
be shown later in Section |IV] that the number of state-action 
pairs and the number of states in Ki are the number of 
variables and the number of constraints in a sub-problem, 
respectively. Thus, we prefer a decomposition that meets 
one simple desirable property: For each i = 0,1,...,(V, 
the number rrii of state-action pairs in Ki is small in the 
sense that the classical linear programming algorithm can 
efficiently solve an MDP with state-action pairs of this size. 
Next, we propose a method that generates decompositions 
which meet the aforementioned desirable property for a 
subclass of MDPs. For an MDP in this subclass, its induced 
graph is Planar It can be shown that MDPs derived from 
classical gridworld examples, which have many practical 
applications in robotic motion planning, are in this subclass. 

We start by relating an MDP with a directed graph. 

Definition 1: The labeled digraph induced from an MDP 
M = {S,A,uq,P) is a tuple G = {S,E) where 5 is a set 
of nodes, and i? C S' x A x 5 is a set of labeled edges such 
that (s,a,s') G K if and only if P{s,a){s') > 0. 

Let n = card(S) be the total number of nodes in the graph. 
A partition of states in the MDP gives rise to a partition 
of nodes in the graph. Given a partition 11 and a region 
Si G n, a node is said to be contained in Si if some edge of 
the region is incident to the node. A node contained in more 
than one regions is called a boundary node. That is, s G Si 
is a boundary node if and only if there exists (s, a, 3') G E 
or (s', a,s) G E with s' Si. Formally, the boundary nodes 

*A graph is planar if it can be drawn in the plane in such a way that no 
two edges meet each other except at a vertex to which they are incident. 









of Si are Bi = In^ U Outi where In^ = {s G S'i | 3j ^ 
i,s' € Sj, a G ^(s'), and (s',a, s) G E} and Outi = {s G 
Si I 3s' G S' \ Si, a G A(s), and (s, a, s') G E}. We define 
Bo = Uili Bi. Note that since IJ^i In* = Kq, Kq C Bq. 
We use the number of boundary nodes as an upper bound 
on the size of the set Kq of states. 

Definition 2: [20] An r-division of an n-node graph is a 
partition of nodes into 0{n/r) subsets, each of which have 
0{r) nodes and 0{^/r) boundary nodes. 

Reference [20] shows an algorithm that divides a planar 
graph of n vertices into an r-division in 0(n log n) time. 

Lemma 2: Given a partition If of an MDP with n states 
obtained with a r-division the induced graph, the number of 
states in Kq is upper bounded by 0{nly/r) and the number 
of states in Ki is upper bounded by 0{r). 

Proof: Since each boundary node is contained in at most 
three regions and at least one region by the property of 
an r-division [20], the total number of boundary nodes is 
0{y/r- ") = 0{n/y/r). The number of states in Ki is upper 
bounded by the size of Si, which is 0(r). ■ 

To obtain a decomposition, the user specifies an approxi¬ 
mately upper bound on the number of variables for all sub¬ 
problems. Then, the algorithm decides whether there is an 
r-division for some r that gives rise to a decomposition that 
has the desirable property. 

Remark 1: Although the decomposition method proposed 
here is applicable for a subclass of MDPs, the distributed 
synthesis method developed in this paper does not con¬ 
strain the way by which a decomposition is obtained. A 
decomposition may be given or obtained straight-forwardly 
by exploiting the existing modular structure of the system. 
Even if a decomposition does not meet the desirable property 
for the distributed synthesis method, the proposed method 
still applies as long as each subproblem derived from that 
decomposition (see Section |1V-C| | can be solved given the 
limitation in memory and computational capacities. 

IV. Distributed synthesis: Discounted-reward 

AND AVERAGE-REWARD PROBLEMS 

In this section, we show that under a decomposition, the 
original LP problem for a discounted-reward or average- 
reward case can be formulated into one with a sparse 
constraint matrix. Then, we employ block-splitting algorithm 
based on ADMM in [12] for solving the LP problem in a 
distributed manner. 

A. Discounted-reward case 

Given a decomposition {Ki \ i = 0,1,...,W} of an 
MDP, let Xi be a vector consisting of variables x(s, a) for 
all s & Ki with all actions enabled from s. Let ti : Ki —>■ 
{1,... ,card(Ari)} be an index function. The constraints in 
( [T^ can be written as: Lor each s G Kq, 

x(s, a) = uo(s)-f 

aG-A(s) 

N 

x{s',a) ■ P{s',a){s), (4) 

2—0 s'^Ki a'^A{s') 


and for each s G Ki, i = 1,... ,N, 
x(s, a) = uo(s)-f 

aG A(s) 

I'iYl a:(s',a') •P(s',a')(s) 

s' ^Kq a'£A{s') 

+ x(s',a') • P(s',a')(s)). (5) 

s'GKi a'£A{s') 

Recall that, in Lemma [T] we have proven that each s & Ki 
with i Q can only be reached with non-zero probabilities 
from states in Kq and Ki. As a result, for each state s in 
Ki with i 7 ^ 0 and each action a G A(s), the constraint 
on variable x(s, a) is only related with variables in Xi and 
xq. Let X = {xq,Xi, ... ,xn). We denote the number of 
variables in Xi by rrii and the number of states in the set Ki 
by rii. Let m = J2f=o The LP problem in Q is then 

N 

min y^ cjXj , subject to Ax = b, (6) 

+ j=0 

where c[xj = Zsgk, 'EaeA(s) -R{s,a)x{s,a), 



"Aoo Aoi Ao 2 ... Aon' 


'bo' 


Aio All 


bi 

A = 

A20 A22 0 

, b = 

bo 


; 0 




-Ajvo Ann. 


.bN. 


and bi G K"' where bi{k) = uq{s) if Li{s) = k. The 
transformation from Q and (0 to is straightforward by 
rewriting the constraints and we omit the detail. 


B. Average-reward case 

Lor an ergodic MDP, the constraints in the LP problem 
of maximizing the average reward, described by ( [Sb] ), can be 
rewritten in the way just as how © is rewritten into Q and 
0 for the discounted-reward problem. The difference is that 
for the average-reward case, we let 7 = 1 and replace uq{s) 
with 0, for all s G S', in Q and Q. An additional constraint 
for the average-rewai'd case is that J2seS J2aeA{s) x{s, a) = 
1. Hence, for an average-reward problem in an ergodic MDP, 
the corresponding LP problem in 0 is formulated as 

N 

min cjxi, subject to 
+ j=o 

where 1^ is a row vector of m ones, cjxj = 
'l2seK A2aeA(s) ~B{s,a)x{s,a) and the block-matrix A 
has the same structure as of that in the discounted case with 
7 = 1. We can compactly write the constraint in 0 as 
A'x = b where A! is a sparse constraint matrix similar in 
structure to the matrix A in the discounted-reward case. 


1^ 

A 


X = 


(7) 












C. Distributed optimization algorithm 

We solve the LP problems in (|^ and Q by employing the 
block splitting algorithm based on ADMM in [12]. We only 
present the algorithm for the discounted-reward case in (|^. 
The extension to the average-reward case is straight-forward. 

First, we introduce new variables yi and let fi{yi) = 
where for a convex set C, Ic is a function dehned 
by Ic{x) = 0 for X € C, Ic{x) = oo for x ^ C. Then, 
adding the term fi{yi) into the objective function enforces 
yt = h. Let gi{xi) = cjxi -f I^r^i{xi). The term I^^i{x^) 
enforces that xi is a non-negative vector. We rewrite the LP 
problem in (|^ as follows. 

N N 

min V/, (y,) + V 5, (a:*) 

x.y 

A (8) 

subject to j/o = 2. and for i = 1 ,..., 

i =0 

yi = AmXd + Aiix^. 

With this formulation, we modify the block splitting algo¬ 
rithm in [12] to solve ([^ in a parallel and distributed manner 
(see the Appendix for the details). The algorithm takes 
parameters p, and p > 0 is a penalty parameter 
to ensure the constraints are satisfied, e’"®* > 0 is a relative 
tolerance and > 0 is an absolute tolerance. The choice 
of and depends on the scale of variable values. In 
synthesis of MDPs, and may be chosen in the range 
of 10“^ to 10“®. The algorithm is ensured to converge with 
any choice of p and the value of p may affect the convergence 
rate. 

V. Extension to quantitative temporal logic 

CONSTRAINTS 

We now extend the distributed control synthesis methods 
for MDPs with discounted-reward and average-reward crite¬ 
ria to solve Problem [T] in which quantitative temporal logic 
constraints are enforced. 

A. Maximizing the probability of satisfying an LTL specifi¬ 
cation 

Preliminaries Given an LTL formula p as the system 
specification, one can always represent it by a deterministic 
Rabin automaton (DRA) A^p = (Q, 2^^, T,/, Acc) where 
Q is a hnite state set, 2^^ is the alphabet, I G Q is 
the initial state, and T : Q x 2-^^ —>^ Q the transition 
function. The acceptance condition Acc is a set of tuples 
{{Ji, Hi) G 2*5 X 2*5 I I = 1 , ...,^}. The run for an inhnite 
word w = o'o'Ti ■ ■ • G (2^^)'^ is the inhnite sequence of 
states qoQi ... G <5“ where qo = I and = T{qi, ai) for 
i > 0. A run p = q^qi ... is accepted in Ap if there exists 
at least one pair [Ji^Hf) G Acc such that Inf(p) n = 0 
and Inf(p) Hi % where Inf(p) is the set of states that 
appear inhnitely often in p. 

Given an MDP M = {S, A^uq, P) augmented with a set 
AV of atomic propositions and a labeling function L : S —t 
2^^, one can compute the product MDP Ai = M x Ap = 


(y, A, A, uo, Acc) with the components dehned as follows: 

V = S X Q is the set of states. A is the set of actions. 
The initial probability distribution of states is po : ^ 
[0,1] such that given v = {s,q) with q = T{I, L{s)), it is 
that po(v) = uo{s). A : y X A — >■ DiV) is the transition 
probability function. Given v = {s,q), a, v' = {s',q') and 
q' = T{q, L{s')), let A{v,a){v') = P{s,a){s'). The Rabin 
acceptance condition is Acc = {{Ji,Hi) \ Ji = S x Hi = 
SxH,fi = 

By construction, a path p = vqVi ... G y“ satishes the 
LTL formula ip if and only if there exists i G 
Inf(p) fi Ji = </} and Inf(p) f] Hi ib. To maximize the 
probability of satisfying p, the hrst step is to compute the 
set of end components in A4, each of which is a pair {W^ /) 
where PL C y is non-empty and / : PL —>^ 2"^ is a 
function such that for any v G W, for any a G f(v), 
A(v, a)(v') = 1 and the induced directed graph 
(py, —)•/) is strongly connected. Here, u —>■/ u' is an edge in 
the graph if there exists a G f{v), A{v,a){v') > 0. An 
end component (PF, /) is accepting if PL n = 0 and 
PL n 7 ^ 0 for some i G 

Let the set of accepting end components (AEC)s in Al be 
AEC(Al) and the set of accepting end states be C = {t; | 
3{W, f) G AEC(A4),v G py}. Once we enter some state 

V G C, we can hnd an AEG (py, /) such that v G W, and 

initiate the policy / such that for some i G states 

in Ji will be visited a hnite number of times and some state 
in Hi will be visited inhnitely often. 

Formulating the LP problem An optimal policy that max¬ 
imizes the probability of satisfying the specihcation also 
maximizes the probability of hitting the set of accepting 
end states C. Reference [21] develops GPU-based parallel 
algorithms which signihcantly speed up the computation of 
end components for large MDPs. After computing the set of 
AECs, we formulate the following LP problem to compute 
the optimal policy using the proposed decomposition and 
distributed synthesis method for discounted-reward cases. 

Given a product MDP Ai = (V, A, A, po, Acc) and the 
set C of accepting end states, the modihed product MDP is 
A4 = ((y\C)U{sink}, A, A,/io,i?) where (y\C)U{sink} 
is the set of states obtained by grouping states in C as a 
single state sink. Eor all a G A, A(sink,a)(sink) = 1 and 
A(u,a)(sink) = A{v,a){v'). The initial distribution 

fio of states is dehned as follows: Eor v G V \ C, flo{v) — 
Po{v), and /lo(sink) = Aoix)- The reward function 

R : ((y \ C) U {sink}) x A — )■ M is dehned such that for all 

V that is not sink, R{v,a) = E„'G(v\c)u{5ink} Mx,a){v') ■ 
l{sink}(i;') where lj5s:('y) is the indicator function that outputs 
1 if and only if w G A and 0 otherwise. Eor any action 
a G A(sink), i?(sink,a) = 0. 

By dehnition of reward function, the discounted reward 
with 7 = 1 from state v in the modihed product MDP Ai 
is the probability of reaching a state in C from v under 
policy / in the product MDP Ai. Hence, with a decom¬ 
position of AI, the proposed distributed synthesis method 
for discounted-reward problems can be used to compute the 


policy that maximizes the probability of satisfying a given 
LTL specification. 

B. Average reward under Biichi acceptance conditions 

Preliminaries Consider a temporal logic formula p that can 
be expressed as a deterministic Biichi automaton (DBA) 
I, F^) where / are defined 

similar to a DRA and C Q is a set of accepting states. 
A run p is accepted in A^p if and only if Inf(p) n ^ 0. 
Given an MDP M = {S, A, uq, P, AV, L) and a DBA Ap, = 
(Q, 2 ^^,T, qo,Fip), the product MDP with Biichi objective 
is Ai = M K Aip = {V, A, A, pq, F) where components 
V, A, A, po obtained similarly as in the product MDP 
with Rabin objective. The difference is that F C S x F^ 
is the set of accepting states. A path p = vqVi ... € V‘^ 
satisfies the LTL formula ip if and only if Inf(p) n F 7 ^ 0. 

Formulating the LP problem For a product MDP M. 
with Biichi objective, we aim to synthesize a policy that 
maximizes the expected frequency of visiting an accepting 
state in the product MDP Ai = M x Ap,. This type of 
objectives ensures some recurrent properties in the temporal 
logic formula are satisfied as frequently as possible. For 
example, one such objective can be requiring a mobile robot 
to maximize the frequency of visiting some critical regions. 

This type of objectives can be formulated as an average- 
reward problem in the following way: Let the reward function 
R : Fx A —>• K be defined by R(v, a) = J^y'ev 
If(u'). By definition of the reward function, the optimal 
policy with respect to the average-reward criterion is the one 
that maximizes the frequency of visiting a state in F. If the 
product MDP is ergodic, we can then solve the resulting 
average-reward problem by the distributed optimization al¬ 
gorithm with a decomposition of product MDP Ai. 


VI. Case studies 


We demonstrate the method with three robot motion plan¬ 
ning examples. All the experiments were run on a machine 
with Intel Xeon 4 GHz, 8 -core CPU and 64 GB RAM 
running Linux. The distributed optimization algorithm is 
implemented in MATLAB. The decomposition and other 
operations are implemented in Python. 

Figure shows a fraction of a gridworld. A robot 
moves in this gridworld with uncertainty in different terrains 
(‘grass’, ‘sand’, ‘gravel’ and ‘pavement’). In each terrain and 
for robot’s different action (heading north (‘N’), south (‘S’), 
west (‘W’) and east (‘E’)), the probability of arriving at the 
correct cell is 0.9 for pavement, 0.85 for grass, 0.8 for gravel 
and 0.75 for sand. With a relatively small probability, the 
robot will arrive at the cell adjacent to the intended one. 


Figure 2b displays a 20 x 20 gridworld. The grey area and 
the boundary are walls. If the robot runs into the wall, it will 
be bounce back to its original cell. The walls give rise to a 
natural partition of the state space, as demonstrated in this 
figure. If no explicit modular structure in the system can be 
found, one can compute a decomposition using the method 
in section |III-B| In the following example, the wall pattern 
is the same as in the 20 x 20 gridworld. 


NW ; N ; ME 


SW i S i SE 


(a) (b) 

Fig. 2: (a) A fraction of a to x n gridworld. The dash arrow 
represents that if the robot takes action ‘N’, there are non¬ 
zero probabilities for it to arrive at NW, N, and NE cells, 
(b) A 20 X 20 gridworld. A natural partition of state space 
using the walls gives rise to Kq, Ki, K 2 t subsets of 

states. States in Kq are enclosed using the squares. 



A. Discounted-reward case 


We select a subset W of cells as “restricted area” and a 
subset G of cells as “targets”. The reward function is given: 
Eor s & S, S ^ G \J W, R{s,a) = —1 counts for the 
amount of time the robot takes action a. Eor s G W, for all 
a G A(s), i?(s, a) = —1000. Eor s G G, R{s,a) = 100 for 
all a G A(s). Intuitively, this reward function will encourage 
the robot to reach the target with as fewer expected number 
of steps as possible, while avoiding running into a cell in the 
restricted area. We select 7 = 0.9. 

Case 1: To show the convergence and correctness of the 
distributed optimization algorithm, we first consider a 100 x 
100 gridworld example that can be solved directly with a 
centralized algorithm. Since at each cell there are four actions 
for the robot to pick, the total number of variables is 4 x 8220 
for the 100 x 100 gridworld (the wall cells are excluded 
from the set of states). In this gridworld, there is only 1 
target cell. The restricted area include 50 cells. The resulting 
LP problem Q can be solved using CVX, a package for 
specifying and solving convex programs [22]. The problem 
is solved in 4.77 seconds, and the optimal objective value 
under the optimal policy given by CVX is 10. 

Next, we solve the same problem by decomposing the 
state space of the MDP along the walls into 25 regions, 
each of which is a 20 x 20 gridworld. This partition of state 
space yields 75 states for each Ki,i > 0 and 720 states for 
Ko- In which follows, we select p = 80,100, 200, 500,1000 
to show the convergence of the distributed optimization 
algorithm. Irrespective of the choices for p, the average time 
for each iteration is about 0.16 sec. The solution accuracy 
relative to CVX is summarized in Table |I] The ‘rel. error 
in objvaP is the relative error in objective value attained, 
treating the CVX solution as the accurate one, and the 
infeasibility is the relative primal infeasibility of the solution, 
measured by • Eigure shows the convergence of 

the optimization algorithm. 

Case 2: Eor a 1000 x 100 gridworld, the centralized 
method in CVX fails to produce a solution for this large- 










































TABLE I; Relative resolution accuracy for solving the 100 x 
10 - 5 , = 10 - 4 ). 


p 

80 

100 

Iterations 

12001 

11014 

objval 

9.54 

9.90 

rel. error (%) 

4.6 

1.0 

infeasibility 

2.7 X 10-3 

2 X 10- 



Fig. 3; Relative error in the objective value vesus iterations in 
100 X100 gridworld with discounted reward, under p = 1000. 
For clarity, we did not draw the relative error for the initial 
2000 steps, which are comparatively large. 


scale problem. Thus, we consider to solve it using the 
decomposition and distributed synthesis method. In this 
example, we partition the gridworld such that each region 
has 50 X 50 cells, which results in 40 regions. There are 
1160 states in ATg and about 2005 states in each Ki, for 
j = 1,..., 40. In this example, we randomly select 40 cells 
to be the targets and 40 cells to be the restricted areas. By 
choosing p = 1000, e’’®* = 10-®, = IQ-®, the optimal 

policy is solved within 25342 seconds and it takes about 2.6 
seconds for one iteration. The total number of iterations is 
9747. Under the (approximately)-optimal policy obtained by 
distributed optimization, the objective value is 138.73. The 
relative primal infeasibility of the solution is 0.29 x lO-^. 
Figure shows the convergence of distributed optimization 
algorithm. 

B. Average-reward case with quantative temporal logic ob¬ 
jectives 

We consider a 50 x 50 gridworld with no obstacles and 
4 critical regions labeled “i?i” , “Aa”, “Ag” and 
The system is given a temporal logic specification ip := 
□0(7?i A OT^a) A □0(7?3 A '07?4). That is, the robot has 
to always eventually visit region Ri and then i?a, and 
also always eventually visit region i ?3 and then R 4 . The 
number of states in the corresponding DBA is 14 after 
trimming the unreachable states, due to the fact that the 
robot cannot be at two cells simultaneously. The quantitative 
objective is to maximize the frequency of visiting all four 
regions (an accepting state in the DBA). The formulated 
MDP is ergodic and therefore our method for average-reward 
problems applies. 

For an average-reward case, we need to satisfy the con¬ 
straint 1 ^ X = 1 in 0. This constraint leads to slow conver- 


100 gridworld example with discounted reward (under = 


200 

500 

1000 

13868 

11866 

12733 

9.89 

9.96 

9.96 

1.1 

0.38 

0.38 

,885 X 10-3 

0.45 X 10-3 

0.23 X 10 




Fig. 4: Objective value vesus iterations in (a) 1000 x 100 
gridworld (the initial 500 steps are omitted), (b) Objective 
value vesus iterations in 50 x 50 gridworld with a Biichi 
objective. Here we only show the first 1000 iterations as the 
objective value converges to the optimal one after 1000 steps. 

gence and policies with large infeasibility measures in dis¬ 
tributed optimization. To handle this issue, we approximate 
average reward with discounted reward [23]: For ergodic 
MDPs, the discounted accumulated reward, scaled by 1 — 7, 
is approximately the average reward. Further, if is large 
compared to the mixing time [15] of the Markov chain, then 
the policy that optimizes the discounted accumulated reward 
with the discounting factor 7 can achieve an approximately 
optimal average reward. 

Given p = 1000, = IQ-® and = 10-®, 7 = 

0.98, the distributed synthesis algorithm terminates in 14284 
iteration steps and the optimal discounted reward is 0.9998. 
Scaling by 1 — 7 = 0.02, we obtain the average reward 
0.9998 X 0.02 = 0.02, which is the approximately optimal 
value for this average reward under the obtained policy. The 
convergence result is shown in Figure |^and the infeasibility 
measure of the obtained solution is 0.016. 

VII. Conclusion 

For solving large Markov decision process models of 
stochastic systems with temporal logic specifications, we 








developed a decomposition algorithm and a distributed syn¬ 
thesis method. This decomposition exploits the modularity 
in the system structure and deals with sub-problems of 
smaller sizes. We employed the block splitting algorithm 
in distributed optimization based on the alternating direc¬ 
tion method of multipliers to cope with the difficulty of 
combining the solutions of sub-problems into a solution to 
the original problem. Moreover, the formal decomposition- 
based distributed control synthesis framework established 
in this paper facilitates the application of other distributed 
and parallel large-scale optimization algorithms [24] to fur¬ 
ther improve the rate of convergence and the feasibility 
of solutions for control synthesis in large MDPs. In the 
future, we will develop an interface to PRISM toolbox [8] 
with an implementation of the proposed decomposition and 
distributed synthesis algorithms. 


Appendix 

At the fc-th iteration, for i,j = 0,..., N, 

_ prox^.(2/f - yf) = bi, 
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where proj^^i denotes the projection to the nonnegative 
orthant, proj^^ denotes projection onto {(x, y) \ y = 
Aijx}. avg is the elementwise averaging R and exch is 
the exchange operator, defined as below. exch(c, 
is given by := Cj -f (c - - 1) and 

Di := c — (c — '^f=iCj)/N — 1. The variables can be 
initialized to 0 at A: = 0. Note that the computation in each 
iteration can be parallelized. The iteration terminates when 
the stopping criterion for the block splitting algorithm is met 
(See [12] for more details). The solution can be obtained 

' ■ k+l/ 2 s^ 


- (^k+ 1/2 


X* = (x, 


'-N 
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